Proplem:
We live in a time when video games are extremely popular. The global
video game market continues to grow year-on-year, and the industry is
now valued at over $100 billion worldwide. With technology continuously
pushing the boundaries, video games have only become more popular and
more high-quality. Gameplay mechanics, cutting-edge graphics, and
intricate storylines make today’s games more immersive than ever before.
We chose this dataset to gain insights on the popularity of different
gaming platforms and the most successful genres associated with those
platforms.
Data Mining Task:
Our task data mining task is to predict the popularity of upcoming
games using regression.
Description of the dataset:
The dataset provided by vgchartz.com supply us with a valuable
resource to explore the platforms and genres of the top 16599 global
video games. Through it, we can analyze the most popular platforms and
genres that are influencing global sales, and detectr how regions’ sales
affect global sales.
Our goal:
Our goal from studying this dataset is to utilize classification and
clustering techniques on the input data to make predictions about the
popularity of upcoming games.
Attributes description:
| Rank |
Ranking of the game based on global sales. |
Numeric |
| Name |
Name of the game. |
Nominal |
| Platform |
Platform the game was released on. |
Nominal |
| Year |
Year the game was released. |
Ordinal |
| Genre |
Genre of the game |
Nominal |
| Publisher |
Publisher of the game. |
Nominal |
| NA_Sales |
Sales of the game in North America |
Numeric (ratio-scaled) |
| EU_Sales |
Sales of the game in Europe |
Numeric (ratio-scaled) |
| JP_Sales |
Sales of the game in Japan |
Numeric (ratio-scaled) |
| Other_Sales |
Sales of the game in other regions |
Numeric (ratio-scaled) |
| Global_Sales |
Total sales of the game worldwide |
Numeric (ratio-scaled) |
Class label:
Popular’ is our class label, we will use Global_Sales attribute to
predict whether a game will sell 1000000 or more globally.
loading libraries needed for our data mining tasks:
library(outliers)
library(dplyr)
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(Hmisc)
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:dplyr’:
src, summarize
The following objects are masked from ‘package:base’:
format.pval, units
library(ggplot2)
library(mlbench)
library(caret)
Loading required package: lattice
options(max.print=9999999)
Importing our dataset:
dataset=read.csv("Dataset/vgsales.csv")
General info about our dataset:
cheking number of rows and columns, and cheking dimensionality and
coulumns names
nrow(dataset)
ncol(dataset)
dim(dataset)
names(dataset)
Dataset structure:
str(dataset)
sample of raw dataset(first 10 rows):
head(dataset, 10)
sample of raw dataset(last 10 rows):
tail(dataset, 10)
summary of our dataset:
summary(dataset)
variance of numeric data:
var(dataset$NA_Sales)
var(dataset$EU_Sales)
var(dataset$JP_Sales)
var(dataset$Other_Sales)
var(dataset$Global_Sales)
Graphs:
dataset2 <- dataset %>% sample_n(50)
tab <- dataset2$Platform %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100
txt <- paste0(names(tab), '\n', precentages, '%')
pie(tab, labels=txt , main = "Pie chart of Platform")
We notice from the pie chart of platform attribute that releasing a
game for PS users will increase the popularity of the game since it is
the most common platform among gamers.
# coloring barplot and adding text
tab<-dataset$Genre %>% table()
precentages<-tab %>% prop.table() %>% round(3)*100
txt<-paste0(names(tab), '\n',precentages,'%')
bb <- dataset$Genre %>% table() %>% barplot(axisnames=F, main = "Barplot for Popular genres ",ylab='count',col=c('pink','blue','lightblue','green','lightgreen','red','orange','red','grey','yellow','azure','olivedrab'))
text(bb,tab/2,labels=txt,cex=1.5)
In terms of genre, action games are the most popular, followed by
sports and music games. It is safe to assume that a high number of
genres of this nature exist due to their popularity and sales.
boxplot(dataset$NA_Sales , main="
BoxPlot for NA_Sales")
The boxplot of the NA_Sales (Sales of the game in north America)
attribute indicates that the values are close to each other ,and there
are a lot of outliers since the dataset represents all the north America
sales of video games.
boxplot(dataset$EU_Sales, main="
BoxPlot for EU_Sales")
The boxplot of the EU_Sales (sales of the game in Europe) attribute
indicates that the values are close to each other, and there are a lot
of outliers since the dataset represents all the Europe sales of video
games.
boxplot(dataset$JP_Sales , main="
BoxPlot for JP_Sales")
The boxplot of the JP_Sales (sales of the game in Japan) attribute
indicates that the values are close to each other, and there are a lot
of outliers since the dataset represents all the Japan sales of video
games.
boxplot(dataset$Other_Sales , main="
BoxPlot for Other_Sales")
The boxplot of the Other-sales attribute indicate that the values are
close to each other ,and there is a lot of outliers since the dataset
represents the global sales of video games.
boxplot(dataset$Global_Sales , main="BoxPlot for Global_Sales")
The boxplot of the Global-sales attribute indicate that the values
are close to each other ,and there is a lot of outliers since the
dataset represents the global sales of video games.
qplot(data = dataset, x=Global_Sales,y=Genre,fill=I("yellow"),width=0.5 ,geom = "boxplot" , main = "BoxPlots for genre and Global_Sales")
In the boxplot we can see that all the genres have Glob_ sales close
to each other, but we notice an outlier that reaches more than 80 Glob_
sales which is a game with genre sports.
dataset$Year %>% table() %>% barplot( main = "Barplot for year")
By the barplot we can see that the number of video games were low
from 1980 year until the 2000 numbers of games grow to more than 1200
till 2012.
pairs(~NA_Sales + EU_Sales + JP_Sales + Other_Sales + Global_Sales, data = dataset,
main = "Sales Scatterplot")
We used Scatterplot to determine the type of correlation we have
between the sales; we can see that the majority have positive
correlation with each other.
(Pre - processing):
Null checking:
we checked nulls values to know how many nulls values we have, so we
can determine how we will deal with them.
sum(is.na(dataset$Rank))
NullRank<-dataset[dataset$Rank=="N/A",]
NullRank
checking for nulls in Rank (there is no nulls)
sum(is.na(dataset$Name))
NullName<-dataset[dataset$Name=="N/A",]
NullName
checking for nulls in name (there is no nulls)
sum(is.na(dataset$Platform))
NullPlatform<-dataset[dataset$Platform=="N/A",]
checking for nulls in Platform(there is no nulls)
sum(is.na(dataset$Year))
NullYear<-dataset[dataset$Year=="N/A",]
NullYear
checking for nulls in year we won’t delete the null and we will leave
them as global constant because we want the sales data out of them.
sum(is.na(dataset$Genre))
NullGenre<-dataset[dataset$Genre=="N/A",]
NullGenre
checking for nulls in Genre(there is no nulls)
sum(is.na(dataset$Publisher))
NullPublisher<-dataset[dataset$Publisher=="N/A",]
NullPublisher
checking for nulls in Publisher. we won’t delete the null and we will
leave them as global constant as it is because we want the sales data of
them.
sum(is.na(dataset$NA_Sales))
NullNA_Sales<-dataset[dataset$NA_Sales=="N/A",]
NullNA_Sales
checking for nulls in NA_Sales (there is no nulls)
sum(is.na(dataset$EU_Sales))
NullEU_Sales<-dataset[dataset$EU_Sales=="N/A",]
NullEU_Sales
checking for nulls in EU_Sales (there is no nulls)
sum(is.na(dataset$JP_Sales))
NullJP_Sales<-dataset[dataset$JP_Sales=="N/A",]
NullJP_Sales
checking for nulls in JP_Sales (there is no nulls)
sum(is.na(dataset$Other_Sales))
NullOther_Sales<-dataset[dataset$Other_Sales=="N/A",]
There is no null values in the other_sales.
sum(is.na(dataset$Global_Sales))
NullGlobal_Sales<-dataset[dataset$Global_Saless=="N/A",]
There is no null values in the Global_Sales.
Encoding:
We will encode our categorical data since most machine learning
algorithms work with numbers rather than text.
dataset$Platform=factor(dataset$Platform,levels=c("2600","3DO","3DS","DC","DS","GB","GBA","GC","GEN","GG","N64","NES","NG","PC","PCFX","PS","PS2","PS3","PS4","PSP","PSV","SAT","SCD","SNES","TG16","Wii","WiiU","WS","X360","XB","XOne"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))
this column will be encoded to facilitate our data mining task.
dataset$Genre=factor(dataset$Genre,levels=c("Action","Adventure","Fighting","Platform","Puzzle","Racing","Role-Playing","Shooter","Simulation","Sports","Strategy","Misc"),labels=c(1,2,3,4,5,6,7,8,9,10,11,12))
Since most machine learning algorithms work with numbers and not with
text or categorical variables, this column will be encoded to facilitate
our data mining task.
Outliers:
Analyses and statistical models can be ruined by outliers, making it
difficult to detect a true effect. Therefore, we are checking for them
and removing them if we find any.
outlier of NA_Sales
OutNA_Sales = outlier(dataset$NA_Sales, logical =TRUE)
sum(OutNA_Sales)
[1] 1
Find_outlier = which(OutNA_Sales ==TRUE, arr.ind = TRUE)
outlier of EU_Sales
OutEU_Sales = outlier(dataset$EU_Sales, logical =TRUE)
sum(OutEU_Sales)
[1] 1
Find_outlier = which(OutEU_Sales ==TRUE, arr.ind = TRUE)
outlier of JP_Sales
OutJP_Sales = outlier(dataset$JP_Sales, logical =TRUE)
sum(OutJP_Sales)
[1] 1
Find_outlier = which(OutJP_Sales ==TRUE, arr.ind = TRUE)
outlier of other_sales
OutOS=outlier(dataset$Other_Sales, logical=TRUE)
sum(OutOS)
[1] 1
Find_outlier=which(OutOS==TRUE, arr.ind=TRUE)
outlier of Global_sales
OutGS=outlier(dataset$Global_Sales, logical=TRUE)
sum(OutGS)
[1] 1
Find_outlier=which(OutGS==TRUE, arr.ind=TRUE)
Remove outliers
dataset= dataset[-Find_outlier,]
Normalization:
The normalization of data will improve the performance of many
machine learning algorithms by accounting for differences in the scale
of the input features.
Dataset before normalization:
datsetWithoutNormalization<-dataset
normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataset$NA_Sales<-normalize(datsetWithoutNormalization$NA_Sales)
dataset$EU_Sales<-normalize(datsetWithoutNormalization$EU_Sales)
dataset$JP_Sales<-normalize(datsetWithoutNormalization$JP_Sales)
dataset$Other_Sales<-normalize(datsetWithoutNormalization$Other_Sales)
dataset$Global_Sales<-normalize(datsetWithoutNormalization$Global_Sales)
We chose min-max normalization instead of z-score normalization
because min-max transform the data into a specific range, which enhances
its suitability for visualization and comparison. Additionally, it
simplifies the process of assessing attribute importance and their
contributions to the model.
Feautre selection:
Our class label (popular) refers to Global_Sales.because we have
multiple regions sales we chose to evaluate each region sales based on
their importance to (global_sales) column,and those that are less
important will be deleted from the dataset.
Use roc_curve area as score
roc_imp <- filterVarImp(x = dataset[,7:10], y = dataset$Global_Sales)
Sort the score in decreasing order
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
we will remove the (JP_Sales) because it is of low importance to our
class_label(Global_Sales)
dataset<- dataset[,-9]
Dataset after pre-processing:
print(dataset)
---
title: "Global video games sales"
output: html_notebook
---


# Proplem:
We live in a time when video games are extremely popular. The global video game market continues to grow year-on-year, and the industry is now valued at over $100 billion worldwide. With technology continuously pushing the boundaries, video games have only become more popular and more high-quality. Gameplay mechanics, cutting-edge graphics, and intricate storylines make today's games more immersive than ever before. We chose this dataset to gain insights on the popularity of different gaming platforms and the most successful genres associated with those platforms. 

# Data Mining Task:
Our task data mining task is to predict the popularity of upcoming games using regression.

# Description of the dataset:

The dataset provided by vgchartz.com supply us with a valuable resource to explore the platforms and genres of the top 16599 global video games. Through it, we can analyze the most popular platforms and genres that are influencing global sales, and detectr how regions' sales affect global sales. 

# Our goal:

Our goal  from studying this dataset is to utilize classification and clustering techniques on the input data to make predictions about the popularity of upcoming games.

# Source and link:
Source: Kaggle

URL link: https://www.kaggle.com/datasets/gregorut/videogamesales




# Attributes description:


| **Attributes name** | **Description**                   | **Data type** | 
|-----------------------------|-------------------------------------|---------------------|
|Rank               | Ranking of the game based on global sales. | Numeric       |
| Name            | Name of the game. | Nominal       | 
| Platform      | Platform the game was released on. | Nominal       | 
| Year               | Year the game was released. | Ordinal       | 
| Genre            | Genre of the game | Nominal       | 
| Publisher      | Publisher of the game. | Nominal       | 
| NA_Sales      | Sales of the game in North America | Numeric (ratio-scaled)       | 
| EU_Sales       | Sales of the game in Europe | Numeric (ratio-scaled)        | 
| JP_Sales        | Sales of the game in Japan | Numeric (ratio-scaled)        | 
| Other_Sales | Sales of the game in other regions | Numeric (ratio-scaled)        | 
| Global_Sales  | Total sales of the game worldwide | Numeric (ratio-scaled)     |     


# Class label:

Popular' is our class label, we will use Global_Sales attribute to predict whether a game will sell 1000000 or more globally. 





# loading libraries needed for our data mining tasks:
```{r}
library(outliers) 
library(dplyr)
library(Hmisc)
library(ggplot2)
library(mlbench)
library(caret)
options(max.print=9999999)
```





# Importing our dataset:
```{r}
dataset=read.csv("Dataset/vgsales.csv")
```




# General info about our dataset:

cheking number of rows and columns, and cheking dimensionality and coulumns names
```{r}
nrow(dataset)
ncol(dataset)
dim(dataset)
names(dataset)
```




Dataset structure:
```{r}
str(dataset)
```



sample of raw dataset(first 10 rows):
```{r}
head(dataset, 10)
```

sample of raw dataset(last 10 rows):
```{r}
tail(dataset, 10)
```

summary of our dataset:
```{r}
summary(dataset)
```

variance of numeric data:
```{r}
var(dataset$NA_Sales)
var(dataset$EU_Sales)
var(dataset$JP_Sales)
var(dataset$Other_Sales)
var(dataset$Global_Sales)
```






# Graphs:

```{r}
dataset2 <- dataset %>% sample_n(50)
tab <- dataset2$Platform %>% table()
precentages <- tab %>% prop.table() %>% round(3) * 100 
txt <- paste0(names(tab), '\n', precentages, '%') 

pie(tab, labels=txt , main = "Pie chart of Platform") 

```

We notice from the pie chart of platform attribute that releasing a game for PS users will increase the popularity of the game since it is the most common platform among gamers. 





```{r}
# coloring barplot and adding text
tab<-dataset$Genre %>% table() 

precentages<-tab %>% prop.table() %>% round(3)*100 

txt<-paste0(names(tab), '\n',precentages,'%') 

bb <- dataset$Genre %>% table() %>% barplot(axisnames=F, main = "Barplot for Popular genres ",ylab='count',col=c('pink','blue','lightblue','green','lightgreen','red','orange','red','grey','yellow','azure','olivedrab')) 

text(bb,tab/2,labels=txt,cex=1.5) 
```
In terms of genre, action games are the most popular, followed by sports and music games. It is safe to assume that a high number of genres of this nature exist due to their popularity and sales.





```{r}
boxplot(dataset$NA_Sales , main="
BoxPlot for NA_Sales")
```
The boxplot of the NA_Sales  (Sales of the game in north America) attribute indicates that the values are close to each other ,and there are a lot of outliers since the dataset represents all the north America sales of video games.

```{r}
boxplot(dataset$EU_Sales, main="
 BoxPlot for EU_Sales")
```
The boxplot of the EU_Sales (sales of the game in Europe) attribute indicates that the values are close to each other, and there are a lot of outliers since the dataset represents all the Europe sales of video games.

```{r}
boxplot(dataset$JP_Sales , main="
 BoxPlot for JP_Sales")
```
The boxplot of the JP_Sales (sales of the game in Japan) attribute indicates that the values are close to each other, and there are a lot of outliers since the dataset represents all the Japan sales of video games.


```{r}
boxplot(dataset$Other_Sales , main="
 BoxPlot for Other_Sales") 
```  

The boxplot of the Other-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games. 




```{r}
boxplot(dataset$Global_Sales , main="BoxPlot for Global_Sales")

```  
The boxplot of the Global-sales attribute indicate that the values are close to each other ,and there is a lot of outliers since the dataset represents the global sales of video games. 



```{r}
qplot(data = dataset, x=Global_Sales,y=Genre,fill=I("yellow"),width=0.5 ,geom = "boxplot" , main = "BoxPlots for genre and Global_Sales")
```

In the boxplot we can see that all the genres have Glob_ sales close to each other, but we notice an outlier that reaches more than 80 Glob_ sales which is a game with genre sports. 

```{r}
dataset$Year %>% table() %>% barplot( main = "Barplot for year")
```

By the barplot we can see that the number of video games were low from 1980 year until the 2000 numbers of games grow to more than 1200 till 2012.


```{r}
pairs(~NA_Sales + EU_Sales + JP_Sales + Other_Sales + Global_Sales, data = dataset,
      main = "Sales Scatterplot")
```    
We used Scatterplot to determine the type of correlation we have between the sales; we can see that the majority have positive correlation with each other. 
 
 
      
# (Pre - processing):

# Varaible transformation:
```{r}
dataset$Rank=as.character(dataset$Rank)
```
We transformed the Rank from numric to char,because we will use them as ordinal data.

# Null checking:
we checked nulls values to know how many nulls values we have, so we can determine how we will deal with them.
```{r}
sum(is.na(dataset$Rank))
NullRank<-dataset[dataset$Rank=="N/A",]
NullRank
```
checking for nulls in Rank (there is no nulls)
```{r}
sum(is.na(dataset$Name))
NullName<-dataset[dataset$Name=="N/A",]
NullName
```

checking for nulls in name (there is no nulls)

```{r}
sum(is.na(dataset$Platform))
NullPlatform<-dataset[dataset$Platform=="N/A",]


```
checking for nulls in Platform(there is no nulls)

```{r}
sum(is.na(dataset$Year))
NullYear<-dataset[dataset$Year=="N/A",]
NullYear
```
checking for nulls in year
we won't delete the null and we will leave them as global constant because we want the sales data out of them.

```{r}
sum(is.na(dataset$Genre))
NullGenre<-dataset[dataset$Genre=="N/A",]
NullGenre
```
checking for nulls in Genre(there is no nulls)

```{r}
sum(is.na(dataset$Publisher))
NullPublisher<-dataset[dataset$Publisher=="N/A",]
NullPublisher
```
checking for nulls in Publisher.
we won't delete the null and we will leave them as global constant as it is because we want the sales data of them.

```{r}
sum(is.na(dataset$NA_Sales))
NullNA_Sales<-dataset[dataset$NA_Sales=="N/A",]
NullNA_Sales
```
checking for nulls in NA_Sales (there is no nulls)

```{r}
sum(is.na(dataset$EU_Sales))
NullEU_Sales<-dataset[dataset$EU_Sales=="N/A",]
NullEU_Sales
```
checking for nulls in EU_Sales (there is no nulls)

```{r}
sum(is.na(dataset$JP_Sales))
NullJP_Sales<-dataset[dataset$JP_Sales=="N/A",]
NullJP_Sales
```
checking for nulls in JP_Sales (there is no nulls)


```{r}
sum(is.na(dataset$Other_Sales))
NullOther_Sales<-dataset[dataset$Other_Sales=="N/A",]


```
There is no null values in the other_sales.

```{r}
sum(is.na(dataset$Global_Sales))
NullGlobal_Sales<-dataset[dataset$Global_Saless=="N/A",]


```
There is no null values in the Global_Sales.

# Encoding:
We will encode our categorical data since most machine learning algorithms work with numbers rather than text.

```{r}
dataset$Platform=factor(dataset$Platform,levels=c("2600","3DO","3DS","DC","DS","GB","GBA","GC","GEN","GG","N64","NES","NG","PC","PCFX","PS","PS2","PS3","PS4","PSP","PSV","SAT","SCD","SNES","TG16","Wii","WiiU","WS","X360","XB","XOne"), labels=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31))
```
this column will be encoded to facilitate our data mining task.

```{r}
dataset$Genre=factor(dataset$Genre,levels=c("Action","Adventure","Fighting","Platform","Puzzle","Racing","Role-Playing","Shooter","Simulation","Sports","Strategy","Misc"),labels=c(1,2,3,4,5,6,7,8,9,10,11,12))
```
Since most machine learning algorithms work with numbers and not with text or categorical variables, this column will be encoded to facilitate our data mining task.

# Outliers:
Analyses and statistical models can be ruined by outliers, making it difficult to detect a true effect. Therefore, we are checking for them and removing them if we find any.

outlier of NA_Sales
```{r}
OutNA_Sales = outlier(dataset$NA_Sales, logical =TRUE)
sum(OutNA_Sales)
Find_outlier = which(OutNA_Sales ==TRUE, arr.ind = TRUE)

```
outlier of EU_Sales
```{r}
OutEU_Sales = outlier(dataset$EU_Sales, logical =TRUE)
sum(OutEU_Sales)
Find_outlier = which(OutEU_Sales ==TRUE, arr.ind = TRUE)
```
outlier of JP_Sales
```{r}
OutJP_Sales = outlier(dataset$JP_Sales, logical =TRUE)
sum(OutJP_Sales)
Find_outlier = which(OutJP_Sales ==TRUE, arr.ind = TRUE)
```

outlier of other_sales 
```{r}
OutOS=outlier(dataset$Other_Sales, logical=TRUE)  
sum(OutOS)  
Find_outlier=which(OutOS==TRUE, arr.ind=TRUE)  

```


outlier of Global_sales 

```{r}
OutGS=outlier(dataset$Global_Sales, logical=TRUE)  
sum(OutGS)  
Find_outlier=which(OutGS==TRUE, arr.ind=TRUE)  

```



# Remove outliers 
```{r}
dataset= dataset[-Find_outlier,]
```



# Normalization:
The normalization of data will improve the performance of many machine learning algorithms by accounting for differences in the scale of the input features.

Dataset before normalization:
```{r}
datsetWithoutNormalization<-dataset
```


```{r}
normalize <- function(x) {return ((x - min(x)) / (max(x) - min(x)))}
dataset$NA_Sales<-normalize(datsetWithoutNormalization$NA_Sales)
dataset$EU_Sales<-normalize(datsetWithoutNormalization$EU_Sales)
dataset$JP_Sales<-normalize(datsetWithoutNormalization$JP_Sales)
dataset$Other_Sales<-normalize(datsetWithoutNormalization$Other_Sales)
dataset$Global_Sales<-normalize(datsetWithoutNormalization$Global_Sales)
```
We chose min-max normalization instead of z-score normalization because min-max transform the data into a specific range, which enhances its suitability for visualization and comparison. Additionally, it simplifies the process of assessing attribute importance and their contributions to the model.





# Feautre selection:

Our class label (popular) refers to Global_Sales.because we have multiple regions sales we chose to evaluate each region sales based on their importance to (global_sales) column,and those that are less important will be deleted from the dataset.


Use roc_curve area as score
```{r}
roc_imp <- filterVarImp(x = dataset[,7:10], y = dataset$Global_Sales)
```


Sort the score in decreasing order
```{r}
roc_imp <- data.frame(cbind(variable = rownames(roc_imp), score = roc_imp[,1]))
roc_imp$score <- as.double(roc_imp$score)
roc_imp[order(roc_imp$score,decreasing = TRUE),]
```


we will remove the (JP_Sales) because it is of low importance to our class_label(Global_Sales)
```{r}
dataset<- dataset[,-9]
```

# Dataset after pre-processing:
```{r}
print(dataset)
```

